folditfandomcom-20200222-history
Secondary structure prediction tools
In a Foldit "de-novo" puzzle, players are given a fixed sequence of amino acids, presented as a straight "extended chain". Unlike design puzzles, which also start with an extended chain, no mutation is allowed on de-novo puzzles. Also unlike a design puzzle, a de-novo puzzle typically has some secondary structures (helixes or sheets) defined. The puzzle comments typically state that the secondary structure predictions are "from PSIPRED". The subject of secondary structure predictions came up in #veteran chat on 8 January 2017 (UTC-6). An edited version of the chat log appears below. Background Some general background on the topics discussed in the chat may be helpful. Amino acid sequence and secondary structure notation in Foldit The amino acid sequence (or "primary structure") of a Foldit puzzle is typically represented as a string of single-character amino acid codes. Recent Foldit puzzles typically have the sequence on the web page. For example, for Puzzle 1326, the sequence is: TDDFREELKKMLKEYKRHSQEHYRSSRSTDDGRTSTEVRYDHDNGTSEVRSTSDNGDEEIRKQLKEMKKELKKQG This style is often referred to as "Fasta format". (Fasta has many variations; often there's a short header that gives the sequence a name.) While the prediction shown here is in upper case, Foldit functions, for example structure.GetAminoAcid and structure.SetAminoAcid, use lowercase. Many Foldit recipes use a similar format for secondary structure. The Foldit standard is to use "H" for helix, "E" for sheet, and "L" for loop. The starting secondary structure for Puzzle 1326 is LLHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL in this format. Other tools may use "-" or a blank space for loop. And just to keep things confusing, sheets are sometimes called "strands", and "coil" may be used instead of "loop". On the other hand, there's "coiled coil", where two or more helixes twist together, as seen in puzzle 479. Foldit recipes that work with amino acid sequence and secondary structure The Foldit recipe Print Protein 2.4 prints the amino acid sequence and secondary structure in the format shown above. For convenience, both structures are also presented for copy and paste. The Foldit recipes AA Edit 1.2 and SS Edit 1.2 show the current amino acid and secondary structure sequence, and allow the user to paste in new sequences. The recipe AA Copy Paste Compare v 1.1.1 -- Brow42 combines both amino acid and secondary structure display and change in one recipe. Tools mentioned in the chat The chat mentioned several tools that predict secondary structure and other aspects of a fold based on the amino acid sequence. These tools are available online, and accept the simple "Fasta" format shown above for the input sequence. The first tool is PSIPRED, which is used to produce the secondary structure prediction of most Foldit de-novos. One of PSIPRED's output's is similar to the secondary structure format shown above. Another popular tool is Jpred, which produces several predictions of the secondary structure based on the amino acid sequence. Jpred also attempts to find any matching or similar sequences for published proteins. JPred's main predictions for secondary structure are similar to the format shown above. The chat also mentioned NetSurfP, which produces secondary structure predictions as probabilities for each segment. This led to the Foldit recipe NetSurfP 1.0, which converts NetSurfP output into the secondary structure format shown above (and also reformats the NetSurfP output so it can be more easily pasted into a spreadsheet). Finally, NetTurnP is closely related to NetSurfP, but produces a segment-by-segment analysis of where there are likely to be turns. A Foldit recipe to digest NetTurnP output is no doubt forthcoming. Comparison of predictions The prediction tools described above were compared for Puzzle 1326. PSIPRED One version of the PSIPRED prediction is a simple text file: # PSIPRED HFORMAT (PSIPRED V3.3) 1 2 3 4 5 6 7 123456789012345678901234567890123456789012345678901234567890123456789012345 Conf: 915999999999999851688752057789998400155416887210011678872999999999999997439 Pred: CHHHHHHHHHHHHHHHHHHHHHHHCCCCCCCCCCCEEEEEECCCCCCCEECCCCCCHHHHHHHHHHHHHHHHHCC AA: TDDFREELKKMLKEYKRHSQEHYRSSRSTDDGRTSTEVRYDHDNGTSEVRSTSDNGDEEIRKQLKEMKKELKKQG The secondary structure prediction is: CHHHHHHHHHHHHHHHHHHHHHHHCCCCCCCCCCCEEEEEECCCCCCCEECCCCCCHHHHHHHHHHHHHHHHHCC or translated into Foldit: LHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL this is a little different than the start for Puzzle 1326 LHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL (PSIPRED) LLHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL (Puzzle 1326 start) The difference means that PSIPRED was probably run with different settings for Puzzle 1326 setup. The tool has many different modes and options. Only the default mode was used for this analysis. Some of the modes are proprietary and require a license key to run. Jpred The main Jpred prediction for the sequence from Puzzle 1326 is: OrigSeq TDDFREELKKMLKEYKRHSQEHYRSSRSTDDGRTSTEVRYDHDNGTSEVRSTSDNGDEEIRKQLKEMKKELKKQG Jnet --HHHHHHHHHHHHHHHHHHH---------------EEEEE------EEEE-----HHHHHHHHHHHHHHHHH-- jhmm --HHHHHHHHHHHHHHHHHHH---------------EEEEE------EEEE-----HHHHHHHHHHHHHHHHH-- Jnet and jhmm are two different prediction methods, but here they produced the same results. Converted to Foldit style, here's the comparison to the puzzle 1326 start: LLHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLLLLLEEEEELLLLLLEEEELLLLLHHHHHHHHHHHHHHHHHLL (Jpred) LLHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL (Puzzle 1326 start) Jpred predicts the initial helix is shorter than shown at the start of Puzzle 1326. NetSurfP The NetSurfP output was reduced by the Foldit recipe NetSurfP 1.0: LLHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEEELLLLLEEEEELLLLLHHHHHHHHHHHHHHHHHLL (NetSurfP) LLHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL (Puzzle 1326 start) Combined The four slightly different predictions combined in one box: LHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL (PSIPRED) LLHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLLLLLEEEEELLLLLLEEEELLLLLHHHHHHHHHHHHHHHHHLL (Jpred) LLHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEEELLLLLEEEEELLLLLHHHHHHHHHHHHHHHHHLL (NetSurfP) LLHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL (Puzzle 1326 start) As Susume mentions in the chat, all these tools probably have a similar weak spot, which is predicting sheets on the outside of a protein. Puzzle 1326 is likely a protein originally designed by Foldit players. Foldit design often have relatively flat section of two or more sheets opposite one or more helixes. This is referred to as the "hotdogs and surf board" model in the chat, where the hotdogs are the helixes and the surfboard is the sheets. In designs of this type, the sheets on the outer edge of the surfboard tend to have a lot of hydrophobic residues on both the "outer" and "inner" (helix-facing) sides. The prediction services seem to have difficulty guessing that these hydrophobic sequences form sheets. The Chat Here is the #veteran chat that discussed all these tools. The chat has be lightly edited to remove some interspersed conversations and correct a few typos. All times are UTC-6, or US Central Standard Time. Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt! Addendum *In The Age of Doubt by Andrea Camilleri (Penguin Books. 2012. ISBN 978-0-14-312092-6.), Montalbano tries to relay the words "Kimberly Process" to Catarella over the phone. "Once they got past the stumbling block at the K, there was still the Y at the end." The base Italian alphabet consists of 21 letters and does not include K (lysine), Y (tyrosine), or W (tryptophan). You'd have to know Vigata to get the rest. Category:Biochemistry